Voice

Druid enables AI Agent voice capabilities to meet the demand for hands-free, conversational interactions across various business scenarios. This allows users to communicate with AI Agents naturally through two primary modes:

  • Telephony. Users can interact with AI Agents via traditional phone lines. This is ideal for automating call center triage before agent hand-off, or providing automated HR and IT Help Desk support through a dedicated phone number and telephone exchange.
  • Voice intranet page. Users can use voice commands directly within a web interface. For example, a user can verbally instruct an AI Agent to perform tasks or edit documents while working within an intranet page.

The voice channel is currently available as a technology preview via the Druid web snippet. You can configure and test voice conversations within the Druid Portal or on hosted web snippets.

How the Voice Channel works with the WebChat snippet

  1. Press the microphone button in the chat snippet to start talking with the AI Agent.
  2. HINT: If you don’t see the microphone icon, set up the voice service provider and configure the voice channel.

  3. Your voice is processed by the Speech-to-Text (STT) service as you speak. You will see the transcript in the input area. When you complete the sentence, the text is sent to the AI Agent.
  4. The AI Agent processes the text and responds with a text message. You will see the text response in the chat snippet, and the AI Agent will also speak the response to you.
  5. The spoken response is delivered by the Text-to-Speech (TTS) service.
  6. NOTE: In Flow authoring, you have a dedicated <Voice> setting for each step where you can customize a spoken response specifically for Voice channels, distinct from the text response.

Set up the speech provider

Druid delivers Speech-to-Text (STT) and Text-to-Speech (TTS) functionality through integrations with industry-leading Technology Partners. Out-of-the-box support includes:

  • Microsoft Cognitive Services
  • ElevenLabs (TTS available starting with Druid 9.15 and STT available starting with Druid 9.18)
  • Deepgram (STT only)

To integrate a preferred speech provider not listed above, please contact Druid Tech Support.

Setting up the Microsoft Cognitive Service

IMPORTANT! To use the Voice channel in production environments, contact Druid Tech Support for the necessary keys.
  1. In the Druid Portal, go to your AI Agent settings.
  2. Select the AI & Cognitive Services category and click Microsoft Cognitive Service.
  3. Provide the Key and Region provided by Druid Support Team in the voice channel activation email.
  4. HINT: For demo purposes, you can request a test key to Druid Tech Support.
  5. Map the languages your AI Agent supports to specific voices in the configuration table.
    1. In the table below the Voice channel details, click the plus icon (+) to add a row.
    2. From the Language dropdown, select the AI Agent language (default or additional).
    3. From the Voice dropdown, select the specific voice the AI Agent will use to respond.
    4. Click the Save icon displayed inline.
  6. Click Save at the bottom of the page and close the modal.

Setting up Deepgram

IMPORTANT! You can use Deepgram as voice provider for Webchat in DRUID 9.1 and higher for Speech-To-Text (STT) only.

Prerequisites

  • You need a Deepgram API Key with Member Permissions. Refer to Deepgram documentation (Token-Based Authentication) for information on how to create a key with Member permissions.

Setup procedure

  1. In the Druid Portal Portal, go to your AI Agent settings.
  2. Select the AI & Cognitive Services category and click Deepgram.
  3. Enter your Deepgram API Key.
  4. Map the languages your AI Agent supports to specific Deepgram models in the configuration table.
    1. In the table, click the plus icon (+) to add a row.
    2. From the Language dropdown, select the AI Agent language (default or additional).
    3. From the Model dropdown, select the specific Deepgram model the AI Agent will use to respond.
    4. Click the Save icon displayed inline.

    HINT: For Druid versions prior to 9.6, provide the Deepgram model (e.g., nova-2-medical). See Deepgram documentation for the complete list of models available.
  5. Click Save at the bottom of the page and close the modal.

Setting up ElevenLabs

Druid supports ElevenLabs as a high-quality Text-to-Speech (TTS) and Speech-To-Text (STT) provider, enabling your AI Agent to communicate using specialized synthetic voices and custom voice clones.

NOTE: ElevenLabs is available as TTS provider starting with Druid 9.15 and as STT provider starting with Druid 9.18.

Prerequisites

  • You need an ElevenLabs API Key. To get API key, go to https://elevenlabs.io/app/developers/api-keys and copy the key ID.
  • Make sure to grant the API Key Read permissions for the following endpoints:
    • Voices
    • Text to Speech
    • Speech to Speech
    • Speech to Text (for STT support)
    • Sound Effects
    • Audio Isolation

Setup procedure

  1. In the Druid Portal Portal, go to your AI Agent settings.
  2. Select the AI & Cognitive Services category and click ElevenLabs.
  3. Enter your ElevenLabs API Key.
  4. Map the languages your AI Agent supports to specific ElevenLabs languages in the configuration table.
    1. In the table, click the plus icon (+) to add a row.
    2. From the Language dropdown, select the AI Agent language (default or additional).
    3. From the Voice dropdown, select the specific ElevenLabs voice the AI Agent will use to respond. The model is automatically filled in after you select the voice.
    4. Click the Save icon displayed inline.

  5. Click Save at the bottom of the page and close the modal.

Configure the Voice channel

Once a speech provider is active, you must explicitly tell the Webchat channel to use these services:

  1. Select the Web & Email category and click the WebChat channel.
  2. Select the primary Speech-to-Text Provider. If you select a provider other than Azure, you should also select a Fallback Speech-to-Text Provider. The fallback speech provider will be used automatically if primary speech provider does not support the user’s language. In Druid 9.18, you can also select both Azure and ElevenLabs as STT fallback provider.
  3. HINT: If Deepgram is set as the primary STT provider and Azure/ElevenLabs as fallback, and the user selects a language unsupported by Deepgram, the system falls back to Azure/ElevenLabs. Once the user changes the language back to one that Deepgram supports, the system will automatically switch back to Deepgram (the primary provider) for the remainder of the session.
  4. Select the primary Text-to-Speech Provider. If you select ElevenLabs, you should also select Azure as a Fallback Text-to-Speech Provider. Azure will be used automatically if ElevenLabs does not support the user’s language.
  5. Click Save at the bottom of the page and close the modal.

A microphone icon will automatically appear in the webchat snippet. This allows users to switch from text to voice conversations seamlessly, enabling natural vocal interaction with the AI Agent.

How the Voice Channel works with SDL Real-time Machine Translation

If you use a translation service for real-time translation and activate the Voice channel, the AI Agent will play back the response in the user's language.

When activating SDL machine translation, you can choose when the translation is performed: at conversation time or authoring time. For more information, see Using Machine Translation.

Voice Channel with Conversation Time Translation

  1. The user speaks in Language A.
  2. Speech-to-Text (STT) is performed in Language A.
  3. The text is translated into the AI Agent default language.
  4. NLP is performed in the AI Agent default language.
  5. A response is generated in the AI Agent default language.
  6. The response is translated back into Language A.
  7. The AI Agent responds with text in Language A.
  8. The response text is converted into audio by the Text-to-Speech (TTS) service.
NOTE: When Conversation Time Translation is enabled for the AI Agent, the language selector in the webchat snippet is replaced by a non-selectable World icon. This indicates that the AI Agent can process languages beyond its natively authored set. The webchat snippet automatically adapts to the user's language code. If a user sends a voice message in a language different from the AI Agent pre-configured languages, the snippet detects the change and sends the audio to the Speech-to-Text (STT) service in that specific language. This ensures the AI Agent can accurately transcribe and translate the user's input regardless of the initial language settings.

Voice Channel with Authoring Time Translation

When using Authoring Time Translation, Druid translates the message written in the Voice setting of flow steps from the default AI Agent language to all additional languages.

  1. The user speaks in Language A (default or additional AI Agent language).
  2. Speech-to-Text (STT) is performed in Language A.
  3. NLP is performed in Language A.
  4. The AI Agent responds with text in Language A.
  5. The response text is converted into audio by the Text-to-Speech (TTS) service and spoken to the user.

Related Topics Link IconRelated Topics